Country-specific reference values for PROMIS® pain, physical function and participation measures compared to US reference values

Abstract Introduction Patient-Reported Outcomes Measurement Information System (PROMIS®) is commonly used across medical conditions. To facilitate interpretation of scores across countries, we calculated Dutch reference values for PROMIS Physical Function (PROMIS-PF), Pain Interference (PROMIS-PI), Pain Behavior (PROMIS-PB), Ability to Participate in Social Roles and Activities (PROMIS-APSRA), and Satisfaction with Social Roles and Activities (PROMIS-SSRA), as compared to US reference values. Patients and methods A panel completed full PROMIS-PF (n=1310), PROMIS-PI and PROMIS-PB (n=1052), and PROMIS-APSRA and PROMIS-SSRA (n=1002) item banks and reported their level of health per domain (no, mild, moderate, severe limitations). T-scores were calculated by sample and subgroups (age, gender, self-reported level of domain). Distribution-based and anchor-based thresholds for mild, moderate, and severe scores were determined. Results Mean T-scores were close to the US mean of 50 for PROMIS-PF (49.8) and PROMIS-APSRA (50.6), lower for PROMIS-SSRA (47.5) and higher for PROMIS-PI (54.9) and PROMIS-PB (52.0). Distribution-based thresholds for mild, moderate, and severe scores were comparable to US recommended cut-off values (except for PROMIS-PI) but participants reported limitations ‘earlier’ than suggested thresholds. Conclusion Dutch reference values were close to US reference values for some PROMIS domains but not all. We recommend country-specific reference values to facilitate worldwide PROMIS use. KEY MESSAGES PROMIS offers universally applicable IRT-based efficient and patient-friendly measures to assess commonly relevant patient-reported outcomes across medical conditions. To support the use of PROMIS in daily clinical practice and research across the world, country-specific general population reference values should be obtained. More research is necessary to obtain reliable and valid cut-off values for what constitutes mild, moderate and severe scores from the patients’ perspective.


Introduction
Patient-Reported Outcome Measures (PROMs) are increasingly used for outcome measurement in clinical practice to facilitate value-based health care. There is evidence that the routine use of PROMs can lead to better patient-clinician communication, increased discussion of psychosocial issues and improved shareddecision making [1][2][3]. In addition, beneficial effects of routine PROM use have been found on symptom control, quality of life outcomes, patient satisfaction and even survival [2,[4][5][6][7][8][9][10] as well as on health care expenditure [11][12][13].
The beneficial effects of the routine use of PROMs can only be obtained if PROMs are successfully implemented in daily clinical care [14][15][16]. However, there are many implementation barriers. Important ones are the selection of patient-reported outcomes (PROs) that are most relevant for patients and the selection of the most suitable PROMs to measure these PROs. A quite common approach is to use disease-specific PROMs because it is assumed that these PROMs are most relevant for the patient group at issue and most responsive to their treatment. However, implementing disease-specific PROMs in daily clinical practice is doomed to fail. It is too time-consuming and too costly to implement disease-specific PROMS in electronic health records for every patient group. It is too complex for clinicians and patients to interpret and discuss PROMs with different scales and different cutoff values. And finally, it is too burdensome to ask an increasing number of patients with multiple conditions to complete multiple disease-specific PROMs, often with overlapping content.
Successful implementation of PROMs in routine clinical practice requires a shift towards measuring generic PROs with generic PROMs as much as possible, only supplemented with disease-specific PROMs for outcomes that are really disease-specific such as disease-specific symptoms. Two research findings show that such a shift is possible. First, it has been shown that PROs that matter most to patients are common across conditions [17][18][19]. Examples of commonly relevant outcomes are physical function, pain and participation. Second, it has been shown that generic PROMs developed within the modern framework of item response theory (IRT) [20,21], can have equal or even better responsiveness than traditional generic PROMs developed within the framework of classical test theory [22][23][24][25][26][27][28], especially when they are used as a computerized adaptive test (CAT), where the computer selects relevant questions based on answers to previous questions [29,30].
The Patient-Reported Outcomes Measurement Information System (PROMIS V R ) initiative has developed IRT-based PROMs to measure commonly relevant outcomes such as physical function, pain, fatigue, sleep disturbances, anxiety, depression and the ability to participate in social roles and activities. These PROMs are applicable to adults and children with or without (chronic) diseases [31][32][33]. PROMIS measures can be administered as fixed short forms or CAT. Evidence for sufficient psychometric properties across patient populations is growing [34][35][36][37][38][39][40]. PROMIS measures have been translated into more than 60 languages and are increasingly used across countries [41]. For example, Dutch-Flemish translations of PROMIS measures are available for more than 30 domains and have been validated in different populations [42][43][44][45][46][47][48][49][50][51][52]. PROMIS has recently been recommended as the preferred measurement system for assessing commonly relevant PROs in Dutch daily medical specialty care across patient conditions [53].
To support the use of PROMIS in daily clinical practice and research, reference values from the general population are useful. Most PROMIS measures were centered to have a mean of 50 and an SD of 10 in the US general population. However, the health of populations may be different in other countries so it is useful to assess to what extent references values are similar across countries. Therefore, we aimed to obtain general population-based Dutch reference values for five PROMIS domains: Physical Function, Pain Interference, Pain Behavior, Ability to Participate in Social Roles and Activities and Satisfaction with Social Roles and Activities and compare them with US reference values.

Study participants
A data collection company (Desan Research Solutions) recruited three waves of at least 1000 people from the Dutch general population from an existing internet panel in 2016. The panel was provided by Global Market Insite (GMI). Informed consent to become a panelist was obtained by GMI. Panelists were recruited by an invitation from the panel host to participate. By voluntarily responding to the invitation for this survey, panelists provided informed consent to participate in the study. More details about the panel are provided by Elsman et al. [54]. The study samples were selected to be representative of the Dutch general population with respect to age distribution (18-40; 40-65; >65), gender, educational level (low, middle, high), region of residence (north, east, south, west) and ethnicity (native Dutch, first-and second-generation western immigrant, first-and second-generation nonwestern immigrant).

Procedures
A web-based survey was used, in which skipping items was not allowed. Participants were asked to complete an online questionnaire once. In Wave 1 participants completed the full v1.2 PROMIS Physical Function item bank, in Wave 2 participants completed the full v1.1 PROMIS Pain Interference and v1.1 Pain Behavior item banks, and in Wave 3 participants completed the full v2.0 PROMIS Ability to Participate in Social Roles and Activities and Satisfaction with Social Roles and Activities item banks. Additionally, participants were asked to describe their level of health for each domain on a single item, described below. Afterwards, participants completed questions regarding sociodemographic characteristics (age, gender, education, region of residence and ethnicity). The Medical Ethical Committee of Amsterdam UMC, location VUmc, the Netherlands, confirmed that the study protocol was exempted from ethical approval according to the Dutch Medical Research in Human Subjects Act (WMO), as no experiments were conducted.

Measures
The PROMIS v1.2 Physical Function item bank contains 121 items, measuring the ability to perform activities including upper extremities (dexterity), lower extremities (walking or mobility) and central regions (neck, back), as well as the ability to perform instrumental activities of daily living, such as running errands. The PROMIS v1.1 Pain Interference item bank contains 40 items referring to the self-reported consequences of pain on relevant aspects of one's life, including the extent to which pain hinders engagement with social, cognitive, emotional, physical and recreational activities. The PROMIS v1.1 Pain Behavior item bank contains 39 items referring to verbal or non-verbal and involuntary or deliberate self-reported external manifestations of pain: behaviors that typically indicate to others that an individual is experiencing pain. The PROMIS v2.0 Ability to Participate in Social Roles and Activities item bank contains 35 items measuring the perceived ability to perform one's usual social roles and activities. The PROMIS v2.0 Satisfaction with Social Roles and Activities item bank contains 44 items measuring satisfaction with performing one's usual social roles and activities. In the Physical Function, Pain Interference and both Participation item banks five response options are used. In the Pain Behavior item bank six response options are used (including the option 'had no pain'). The Physical Function item bank and both Participation items banks have no time frame. The Pain item banks use the past 7 days as a time frame. All item banks are scored on a T-score metric, which has an average of 50 and standard deviation (SD) of 10 in the US general population. Higher scores indicate more of the construct being assessed. For example, higher Physical Function scores indicate better physical function, demonstrating good health, whereas higher Pain Interference scores indicate more pain interference, representing poor health.
Five single items were used to measure the overall level of the health domains, one item for each domain (physical function, pain interference, pain behavior, ability to participate in social roles and activities and satisfaction with social roles and activities). For example: 'How would you describe your physical function?'. Response options for all five items were: no limitations, mild limitations, moderate limitations and severe limitations.

Statistical analyses
First, we compared the characteristics of the study participants to data from Statistics Netherlands in 2016 [55] to check for a maximum allowable deviation of 2.5% per sociodemographic variable. Second, we compared our data to a US general population sample to ensure that T-scores of comparable Dutch and US populations can be compared unbiasedly. We used PROMIS wave 1 data, obtained from the HealthMeasures Dataverse repository [56]. We only selected people from the general population (Physical Function n ¼ 1700, Pain Interference n ¼ 946, Pain Behavior n ¼ 881, Ability to Participate in Social Roles and Activities n ¼ 429, Satisfaction with Social Roles and Activities n ¼ 424). In this DIF analysis, we examined whether Dutch and US people with the same level of domain have different probabilities of giving a certain response to an item [57]. We performed Differential Item Functioning (DIF) analyses by comparing a series of ordinal logistic regression models, using the R package Lordif (version 0.3-3) [58]. We used McFadden's pseudo R 2 change of 2% between the models as a criterion for DIF. Uniform DIF exists when the magnitude of the DIF is consistent across the entire range of the trait. Non-uniform DIF exists when the magnitude or direction of DIF differs across the trait. We checked the impact of DIF on total scores by examining test characteristic curves, displaying the difference between the groups when calculating a total raw score based on all items or on items flagged for DIF only.
Third, we calculated PROMIS T-scores per item bank from the raw item scores using the online HealthMeasures Scoring Service program, provided by the US Assessment Center [59]. All participants, including people who reported 'had no pain' on the Pain Behavior item bank were included in the analyses. Tscores were calculated for the entire sample, as well as for subgroups based on age (18-34 years, 35-44 years, 45-54 years, 55-64 years, 65-74 years and !75 years), gender and self-reported level of the domain (anchor-based thresholds). We also calculated distribution-based thresholds for mild, moderate and severe T-scores based on 0.5 Â SD, 1 Â SD and 2 Â SD below (for constructs indicating good health) or above (for constructs indicating poor health) the average of the general population, respectively. We compared the mean T-scores of the Dutch and US populations and the subgroups. For the Physical Function and Pain item banks, we used gender and age range sub-norms for adult PROMIS measures centered on the US General Census 2000, presented on the HealthMeasures website [60]. For the Participation item banks, we calculated Tscores using the US PROMIS 1 Social Supplement, obtained from the HealthMeasures Dataverse repository [56]. We selected only the participants from this Supplement who were recruited from the US general population (Polimetrix sample, n ¼ 1008).

Study participants
The three waves included 1310 (Physical Function), 1052 (Pain) and 1002 (Participation) participants, respectively. Characteristics of the participants are summarized and compared to the Dutch population in 2016 in Table 1. All differences were less than the 2.5% agreed upon.

Comparability of Dutch and US scores
Two items of the Physical Function item bank and two items of the Pain Behavior item bank were flagged for uniform DIF (Table 2). In both cases, for one item the Dutch population endorses higher item response categories at the same level of the domain than the US population, and for the other item, it was the other way round. The impact of DIF on the total scores was considered negligible. No DIF was found for the other item banks.

Dutch PROMIS reference scores
Mean T-scores for the entire samples, and age and gender groups, for the five-item banks are presented in Tables 3 through 5. Mean T-scores in the Dutch general population were close to the mean T-scores in the US population of 50 for Physical Function (49.8) and Ability to Participate in Social Roles and Activities (50.6). However, the Dutch population showed lower levels of Satisfaction with Social Roles and Activities (47.5) and higher levels of Pain Interference (54.9) and Pain Behavior (52.0) than the US population.
Men had slightly better Physical Function and Participation scores than women (about 2 T-score points and 1 T-score point, respectively), while differences in Pain between men and women were less than 1 point. Physical Function levels were worst in the highest age groups, while Pain and Participation levels were worst in the middle age groups (45-64 years).
Distribution-based thresholds for mild, moderate and severe scores based on 0.5 Â SD, 1.0 Â SD and 2.0 Â SD below (for constructs indicating good health) or above (for constructs indicating poor health) the average of the general population were found to be quite similar in the Dutch population as the suggested thresholds for the US population on the HealthMeasures website for Physical Function, Pain Behavior and both Participation item banks (Tables  3-5). For Pain Interference the thresholds were a bit higher in the Dutch population compared to the recommended US values because of the higher mean Pain Interference T-score in the Dutch population. However, anchor-based thresholds, based on mean T-scores for people who self-reported mild, moderate and severe limitations did not coincide with the distribution-based thresholds (Figures 1-5). Overall, people reported limitations 'earlier' (at lower severity levels) than the distribution-based cut-off values. For example, the mean T-scores for people who reported having mild symptoms/functional problems would be classified as within normal limits based on SD cut-off values for all domains, mean T-scores for people who reported having moderate symptoms/functional problems would be classified as mild problems based on SD cut-off values, and mean T-scores for people who reported to have severe symptoms/functional problems would be classified as moderate problems based on SD cut-off values. However, there was wide variation in T-scores within each self-reported limitations subgroup and there was wide overlap in T-score ranges between the subgroups.

Discussion
This study assessed to what extent general population reference values for interpreting PROMIS T-scores were similar in the Netherlands as in the US. Mean T-scores in the Dutch general population were found to be close to the mean T-scores in the US population of 50 for Physical Function (49.8) and Ability to Participate in Social Roles and Activities (50.6). However, the average T-scores in the Dutch population were lower for Satisfaction with Social Roles and Activities (47.5) and higher for Pain Interference (54.9) and Pain Behavior (52.0). Distribution-based thresholds for mild,  moderate and severe scores were comparable to the US recommended cut-off values for most item banks (except Pain Interference) but study participants reported limitations 'earlier' than these suggested distribution-based thresholds.
Only two items of the Physical Function item bank and two items of the Pain Behavior item bank were flagged for DIF, and the impact of DIF on T-scores was considered negligible, indicating that T-scores of comparable Dutch and US populations can be compared  Table 5. PROMIS Ability to Participate in Social Roles and Activities and Satisfaction with Social Roles and Activities Dutch reference values by age and gender and compared with the US reference population [61].

Ability to Participate in Social Roles and Activities
Satisfaction with Social Roles and Activities  unbiasedly. These results are consistent with previous studies in clinical populations [40,44,45,62,63].
Two other studies reported mean T-scores in general population samples from the UK, France, Germany and Norway [64,65]. In the UK, France and Germany slightly higher mean T-scores (about 51-53) were found for Physical Function as compared to the Netherlands (mean T-score 49.8) and lower mean Tscores were found for Pain Interference (about 49-51) as compared to the Netherlands (mean T-score 54.9).    Colored lines indicate the current recommended Dutch PROMIS distribution-based thresholds (green ¼ within normal limits, yellow ¼ mild, orange ¼ moderate, red ¼ severe symptoms). Colored lines indicate the current recommended Dutch PROMIS distribution-based thresholds (green ¼ within normal limits, yellow ¼ mild, orange ¼ moderate, red ¼ severe symptoms).
The study from Norway also reported a mean T-score of 55.0 for Pain Interference, but a lower score for the Ability to Participate (48.3) as compared to the Netherlands (50.6). However, the Norwegian sample was not representative of the Norwegian general population. These studies and our study suggest that it is useful to obtain country-specific reference values for using PROMIS across countries. However, variables, other than country, could also be responsible for the differences in T-scores found between countries. For example, the US values are based on data collected in 2000, and the (perception of) population health may have changed over time. An alternative to countryspecific reference values could be to base reference values on a multi-national data set. However, it is questionable whether this is achievable and, more importantly, whether a 'world average' would be meaningful.
The self-reported limitations by the study participants suggest that thresholds based on SDs may not be a valid indicator of what patients consider mild, moderate, or severe problems. Anchor-based thresholds based on patients' opinions are generally considered more valid that distribution-based thresholds [66]. However, the self-reported limitations in this study were based on a single item only and given the wide variation in T-scores within each self-reported limitations subgroup and the wide overlap in T-score ranges between the subgroups, the validity of the self-reported limitations could be questioned. Previous studies have used a qualitative bookmarking methodology, which includes a ranking of clinical vignettes (i.e. descriptions of health states based on a selection of item responses) by patients or clinicians [67]. Using this method, Bingham et al. found thresholds for Pain Interference of 52, 63 and 72 for mild, moderate and severe Pain Interference, respectively, in RA patients [68]. Cella et al. found comparable thresholds of 50 for mild, 60 for moderate and 70 for severe Pain Interference in oncology patients [69]. We found no studies using this method on the other item banks included in this study. More research is necessary to obtain reliable and valid cut-off values for what constitutes mild, moderate and severe scores from the patients' perspective. For the time being, we recommend using the distribution-based thresholds, consistent with the HealthMeasures recommendations. However, since our data are representative of the Dutch general population, we recommend using the Dutch distribution-based thresholds, obtained in this study, in the Netherlands, unless or until there is sound evidence that this is inappropriate. However, clinicians and researchers should keep in mind that less severe scores may also be considered problematic by patients.
The PROMIS domains addressed in this study are part of the eight PROMIS profile domains, which are considered the most important outcomes across (clinical) populations [70]. Dutch reference scores for the additional PROMIS profile domains Fatigue, Anxiety and Depression, as well as for the PROMIS Global Health Scale are published elsewhere or submitted for publication [54,71,72] and analyses of Dutch reference scores for the Sleep item banks are ongoing.
A strength of this study was the use of large and representative study samples. As indicated above, a limitation of this study was the use of only single items to measure self-reported limitations. Another limitation was that the maximum allowable deviation of 2.5% per sociodemographic variable for comparing the characteristics of the study participants to data from Statistics Netherlands was chosen arbitrarily. We could not find any recommendations for an acceptable deviation from a reference population in the literature. Furthermore, the study was only performed in the Netherlands, while the PROMIS measures are also used in Flanders, the Dutch-speaking part of Belgium. One study investigated DIF for the two pain item banks between Dutch and Flemish RA patients, and found only one item with DIF, with negligible impact [73]. Therefore the reference values obtained in our study may also be relevant for the Flemish population. However, future studies may be needed to investigate whether population levels of pain, function and participation are similar in the Netherlands and Flanders. A final potential limitation of the study was that the data was collected in 2016 and before the COVID-19 pandemic. Current population levels of pain, function and participation may be different. Ideally, reference values should be updated periodically (for example, the Public Health Monitor 2020 of the Dutch Community Health Services, Statistics Netherlands and the National Institute for Public Health and the Environment is updated every four years), but this is dependent upon funding.

Conclusion
This study showed that general population reference values for interpreting PROMIS T-scores were close to US reference values for some PROMIS domains but not all. We recommend obtaining country-specific reference values for using PROMIS across the world. We also recommend using Dutch distribution-based thresholds for mild, moderate and severe scores, but keep in mind that less severe scores may also be considered problematic by patients. More studies are needed to define thresholds based on patients' opinions.